Introduction¶
This notebook contains the final report for the fifth phase of the Web Mining project. The end goal of the project is to "explore how artificial intelligence can be applied to extract meaningful entities from vast textual data and establish connections between them". The fifth phase of the project aims to "compile a comprehensive final report covering all aspects of the project, from introduction to results."
Phase 1: Entity Recognition and Linking with Wikipedia API¶
The first phase of the project covers all of "setting the scene" parts including: defining the topic, extracting data via Wikipedia API and storing scraped pages in a JSON pages.
1.1. Identify Topics¶
Goal: Choose specific topics or categories of interest for entity extraction. This could be based on user input, predefined categories, or a mix of both.
For this project I decided to choose a predefined topic: Russian Invasion of Ukraine.
1.2. Wikipedia API Integration¶
Goal: Implement code to interact with the Wikipedia API to fetch relevant pages based on the identified topics. Explore different API endpoints for extracting content (e.g., action=query, prop=extracts).
After I have examined Wikipedia API, I found out that the most straightforward way to get the pages was to query relevant titles, capture page ids of the relevant articles. Then, via page ids obtain the URLs to the articles and use those URLs to get page contents. Finally, the content of the pages as well as URLs and page titles will be saved in JSON for further data processing. To complete all this we need only two Python libraries: requests for making API queries and json for storing data. Also, we will use BeautifulSoup to illustrate parced pages.
NB: Later, however, I found out that it would be much easier to access the text of the articles via wikipediaapi library. So, after the phase 1 and 2 I re-wrote the previous code into a Python script that contain both data extraction and preparation.
1.2.1. Query page ids¶
{'batchcomplete': '',
'continue': {'sroffset': 3, 'continue': '-||'},
'query': {'searchinfo': {'totalhits': 43810},
'search': [{'ns': 0,
'title': 'Russo-Ukrainian War',
'pageid': 42085878,
'size': 307171,
'wordcount': 24480,
'snippet': 'Russo-<span class="searchmatch">Ukrainian</span> <span class="searchmatch">War</span> is an ongoing <span class="searchmatch">war</span> between <span class="searchmatch">Russia</span> and <span class="searchmatch">Ukraine</span>, which began in February 2014. Following <span class="searchmatch">Ukraine\'s</span> Revolution of Dignity, <span class="searchmatch">Russia</span> occupied',
'timestamp': '2024-04-11T15:32:37Z'},
{'ns': 0,
'title': 'Russian invasion of Ukraine',
'pageid': 70149799,
'size': 389296,
'wordcount': 33997,
'snippet': 'On 24 February 2022, <span class="searchmatch">Russia</span> invaded <span class="searchmatch">Ukraine</span> in an escalation of the Russo-<span class="searchmatch">Ukrainian</span> <span class="searchmatch">War</span> that started in 2014. The invasion became the largest attack on',
'timestamp': '2024-04-11T08:03:38Z'},
{'ns': 0,
'title': 'War crimes in the Russian invasion of Ukraine',
'pageid': 70167888,
'size': 258957,
'wordcount': 23886,
'snippet': 'Since the beginning of the <span class="searchmatch">Russian</span> invasion of <span class="searchmatch">Ukraine</span> in 2022, the <span class="searchmatch">Russian</span> military and authorities have committed <span class="searchmatch">war</span> crimes, such as deliberate attacks',
'timestamp': '2024-04-11T10:53:01Z'}]}}
Above is the illustration of the result of a successful query. Here we can see matching results. I decided to limit the number of pages to three so that it would be easier to test and illustrate the whole pipeline. Later, as the amount of data needed for the project grew, the limit was changed to thirteen.
{'Russo-Ukrainian War': 42085878,
'Russian invasion of Ukraine': 70149799,
'War crimes in the Russian invasion of Ukraine': 70167888}
Above is the intermediate result in the form of a dictionary with page titles as keys and corresponding page is and values. Next, we will iterate through this dictionary to get page URLs.
1.2.2 Query URLs via page_ids¶
{'Russo-Ukrainian War': 'https://en.wikipedia.org/wiki/Russo-Ukrainian_War',
'Russian invasion of Ukraine': 'https://en.wikipedia.org/wiki/Russian_invasion_of_Ukraine',
'War crimes in the Russian invasion of Ukraine': 'https://en.wikipedia.org/wiki/War_crimes_in_the_Russian_invasion_of_Ukraine'}
After making another API call to Wiki servers we extract the page urls via the page ids. Now, with the URLs at hand we can make a HTTP GET request and obtain the contents with the page.
1.3. Data Storage¶
Goal: Design a data structure to store the retrieved Wikipedia pages. Consider using a suitable data format, such as JSON or a database, to organize and store the data.
In the beginning I faced a pseudo-dilema of whether I should save only URLs and later make new HTTP requests and get the data of should I just save the contents of a page. Fortunately I decided to save both in list and the connect this list to a title in dictionary.
In the end, I saved data in a JSON format. First, I used try-except-else statements to check if there is a file containing data. If not, then a new file is created a filled with the data from current session. If there is a file, then the old data are loaded and new data are appended. Finally, the updated data was written back to the file. On top of that, the writing process was executed piece by piece to avoid duplicates. Overall, this way may create a bottleneck in the future due to extensive read-write workload. But as for now, it works alrights, so the benefit of not spamming a bunch of JSON files overcomers the possible drawbacks.
Later, I decided that I can access the text via an API call and derive the url of the Wiki page by its title, so afte some revision I've changed the aritcles.json file to contain the following information:
{
"Russo-Ukrainian War": "Russo-Ukrainian_War",
"Russian invasion of Ukraine": "Russian_invasion_of_Ukraine",
"War crimes in the Russian invasion of Ukraine": "War_crimes_in_the_Russian_invasion_of_Ukraine",
"Casualties of the Russo-Ukrainian War": "Casualties_of_the_Russo-Ukrainian_War",
"Disinformation in the Russian invasion of Ukraine": "Disinformation_in_the_Russian_invasion_of_Ukraine",
"List of aircraft losses during the Russo-Ukrainian War": "List_of_aircraft_losses_during_the_Russo-Ukrainian_War",
"War in Donbas": "War_in_Donbas",
"List of wars between Russia and Ukraine": "List_of_wars_between_Russia_and_Ukraine",
"Russian-occupied territories of Ukraine": "Russian-occupied_territories_of_Ukraine",
"Russian information war against Ukraine": "Russian_information_war_against_Ukraine",
"Child abductions in the Russo-Ukrainian War": "Child_abductions_in_the_Russo-Ukrainian_War",
"Timeline of the Russian invasion of Ukraine": "Timeline_of_the_Russian_invasion_of_Ukraine",
"Russian war crimes": "Russian_war_crimes"
}
As we can see, it contains the name of the topic with its underscores copy from which the url can be derived.
This concludes the first phase of a project.
Phase 2: Preprocessing and Named Entity Recognition¶
The second phase of the project covers all of "text preprocessing" parts including: tokenization, romoving stop words, handling special characters, implementing named entity recognition (NER), entitity linking and exporting the results in a .csv file.
2.1. Text Preprocessing:¶
Goal: Clean and preprocess the text data obtained from Wikipedia pages. Perform tasks like tokenization, removing stop words, and handling special characters.
For tokenization, we use nltk's word_tokenize() function. For removing stopwords we use nltk's list of English stopwords. For handling spectial charactes we use re library and its sub() function. The process takes just 5 rows of code to get clean preprocessed text.
# Tokenization
tokens = word_tokenize(article_text)
# Removing stop words
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
# Handling special characters and lowercasing
cleaned_tokens = [re.sub(r'[^a-zA-Z0-9]', '', word.lower()) for word in filtered_tokens if word.isalnum()]
# Convert list of cleaned tokens back to a string
cleaned_text = ' '.join(cleaned_tokens)
2.2. Named Entity Recognition¶
Goal: Implement NER using libraries like SpaCy or NLTK. Extract entities such as persons, organizations, locations, etc., from the preprocessed text.
I decide to use Spacy for NER. I have checked the avaliable lables it captures and choose the most relevant for the task. This list of the entites' labels can be seen below.
valid_entities = ['PERSON','NORP','FAC','ORG','GPE','LOC','PRODUCT','EVENT','WORK_OF_ART','LAW','LANGUAGE']
# Load SpaCy's English language model
nlp = spacy.load("en_core_web_sm")
# Process the preprocessed text
doc = nlp(cleaned_text)
# # Extract named entities
entities = [(entity.text, entity.label_) for entity in doc.ents if entity.label_ in valid_entities]
Now, we filter through the list of found entities and select only unique ones. Below is the example of a filtered entities:
[('russia', 'GPE'), ('ukraine', 'GPE'), ('russian', 'NORP'), ('donbas', 'GPE'), ('donbas war', 'ORG'), ('vladimir putin', 'PERSON'), ('nato', 'ORG'), ('putin', 'PERSON'), ('soviet', 'NORP'), ('kingdom united states', 'GPE'), ('european', 'NORP'), ('eastern bloc', 'LOC'), ('first chechen war', 'EVENT'), ('viktor yushchenko', 'PERSON'), ('supreme court', 'ORG'), ('yulia', 'GPE'), ('anthony cordesman', 'PERSON'), ('favour putin', 'PERSON'), ('georgia', 'GPE'), ('western european', 'NORP'), ('ukraine georgia', 'GPE'), ('us', 'GPE'), ('george bush', 'PERSON'), ('georgia ukraine', 'ORG'), ('eurasian', 'NORP'), ('eu', 'GPE'), ('kremlin', 'ORG'), ('rada', 'GPE'), ('ukraine russia', 'PERSON'), ('crimean', 'GPE'), ('kacha hvardiiske simferopol', 'PERSON'), ('insignia', 'GPE'), ('aksyonov', 'NORP'), ('ukraine putin', 'PERSON'), ('donbas part', 'ORG'), ('chechen', 'NORP'), ('un', 'ORG'), ('eastern ukraine', 'NORP'), ('insignia girkin', 'ORG'), ('malaysia', 'GPE'), ('united banner', 'ORG'), ('nikolai mitrokhin', 'PERSON'), ('novoazovsk azov sea coast', 'FAC'), ('novoazovsk', 'GPE'), ('mariupol un security council', 'ORG'), ('luhansk suffering', 'ORG'), ('defence ministry', 'ORG'), ('luhansk', 'GPE'), ('house', 'ORG'), ('novaya gazeta', 'PERSON'), ('lev shlosberg', 'PERSON'), ('valentina matviyenko', 'PERSON'), ('andrey kelin', 'PERSON'), ('kelin', 'ORG'), ('philip breedlove', 'PERSON'), ('moscow', 'GPE'), ('chicago council global affairs', 'ORG'), ('united nations security council', 'ORG'), ('united kingdom', 'ORG'), ('united nations human rights office estimated 8000', 'ORG'), ('vladimir yefimov', 'PERSON'), ('red cross', 'ORG'), ('konstantin zatulin', 'PERSON'), ('volodymyr zelenskyy', 'PERSON'), ('ukraine separatists', 'PERSON'), ('nato eu', 'FAC'), ('vladimir lenin', 'PERSON'), ('soviet republic putin', 'GPE'), ('nikita khrushchev', 'PERSON'), ('world war ii', 'EVENT'), ('russian christians', 'NORP'), ('jews', 'NORP'), ('nazi germany', 'GPE'), ('jewish', 'NORP'), ('jens stoltenberg', 'PERSON'), ('evacuations donbas', 'PERSON'), ('federation council', 'ORG'), ('conscription army', 'ORG'), ('eastern ukraine command', 'ORG'), ('aleksandr dvornikov', 'ORG'), ('europe', 'LOC'), ('yugoslav', 'NORP'), ('russians', 'NORP'), ('east bank', 'GPE'), ('united nations general assembly resolution', 'ORG'), ('cnn', 'ORG'), ('khartoum', 'GPE'), ('budanov', 'PERSON'), ('romanian', 'NORP'), ('nord stream pipeline', 'ORG'), ('yuzivska', 'ORG'), ('arsen avakov', 'PERSON'), ('balkan', 'LOC'), ('germany', 'GPE'), ('joe biden', 'PERSON'), ('german', 'NORP'), ('angela merkel', 'PERSON'), ('nord stream political', 'ORG'), ('poland', 'GPE'), ('nord', 'ORG'), ('ukraine naftogaz', 'GPE'), ('yuriy vitrenko', 'PERSON'), ('united states', 'GPE'), ('communist party', 'ORG'), ('syria', 'GPE'), ('iraq', 'GPE'), ('vladislav surkov', 'PERSON'), ('putin russian', 'PERSON'), ('zelenskyy jewish', 'PERSON'), ('natalia', 'GPE'), ('antonova', 'GPE'), ('world war ii ukraine rejection adoption general assembly resolutions combating', 'EVENT'), ('nazi', 'NORP'), ('dmitry medvedev', 'PERSON'), ('medvedev', 'PERSON'), ('medvedev described', 'PERSON'), ('southern city', 'LOC'), ('american', 'NORP'), ('nafo north atlantic', 'PERSON'), ('cartoon shiba inu dogs social media', 'ORG'), ('mikhail ulyanov', 'PERSON'), ('ukrainians', 'NORP'), ('islamic', 'NORP'), ('crocus', 'ORG'), ('krasnogorsk moscow', 'FAC'), ('russia defense ministry', 'ORG'), ('ukraine russian', 'NORP'), ('church hierarch patriarch', 'ORG'), ('germany university', 'ORG'), ('al jazeera roc', 'ORG'), ('nakaz decree council present', 'ORG'), ('roc protodeacon', 'PERSON'), ('christians', 'NORP'), ('sergey lavrov', 'PERSON'), ('eastern', 'ORG'), ('cia', 'ORG'), ('leon panetta', 'PERSON'), ('abc', 'ORG'), ('oleksandr turchynov', 'PERSON'), ('abkhazia south ossetia', 'ORG'), ('soviet union', 'GPE'), ('battalion ukrainian national guard american forces', 'ORG'), ('pentagon', 'ORG'), ('russian european', 'NORP'), ('swiss franc climb high dollar high euro', 'ORG'), ('australian', 'NORP'), ('crimea taken international republican institute', 'ORG'), ('western ukraine', 'NORP'), ('levada', 'ORG'), ('ukraine street', 'GPE'), ('ukraine donbas', 'GPE'), ('congress', 'ORG'), ('afghanistan', 'GPE'), ('washington', 'GPE'), ('timothy snyder', 'PERSON'), ('new york times', 'ORG'), ('revival putin', 'PERSON'), ('putin justification', 'PERSON'), ('christian', 'NORP'), ('putin adolf hitler conquests', 'PERSON'), ('iran', 'GPE'), ('north korea', 'GPE'), ('china', 'GPE'), ('chinese', 'NORP'), ('arab', 'NORP'), ('south africa', 'GPE'), ('south african', 'NORP'), ('united nations', 'ORG'), ('united nations general assembly', 'ORG'), ('nicaragua', 'GPE'), ('encyclopdia britannica', 'FAC'), ('google news war', 'ORG'), ('bbc news', 'ORG')]
2.3. Entity Linking¶
Goal: Integrate entity linking to associate recognized entities with their corresponding Wikipedia pages. Use the Wikipedia API to enhance entity information.
We implement task 3 by specifying a function to fetch Wikipedia article for an entity (if it exists). Then, we add its URL to the list of found entities.
# Function to fetch Wikipedia page for an entity
def fetch_wikipedia_page(entity):
page = wiki.page(entity)
if page.exists():
return page.fullurl
else:
return None
# Entity linking
linked_entities = []
for entity_text, entity_label in filtered_entities:
# Fetch Wikipedia page for the entity
page_url = fetch_wikipedia_page(entity_text)
if page_url:
linked_entities.append((entity_text, entity_label, page_url))
else:
linked_entities.append((entity_text, entity_label, None))
Below is the example of an output with linked entities:
[('russia', 'GPE', 'https://en.wikipedia.org/wiki/Russia'), ('ukraine', 'GPE', 'https://en.wikipedia.org/wiki/Ukraine'), ('russian', 'NORP', 'https://en.wikipedia.org/wiki/Russian'), ('donbas', 'GPE', 'https://en.wikipedia.org/wiki/Donbas'), ('donbas war', 'ORG', 'https://en.wikipedia.org/wiki/War_in_Donbas'), ('vladimir putin', 'PERSON', 'https://en.wikipedia.org/wiki/Vladimir_Putin'), ('nato', 'ORG', 'https://en.wikipedia.org/wiki/NATO'), ('putin', 'PERSON', 'https://en.wikipedia.org/wiki/Vladimir_Putin'), ('soviet', 'NORP', 'https://en.wikipedia.org/wiki/Soviet_Union'), ('kingdom united states', 'GPE', None), ('european', 'NORP', 'https://en.wikipedia.org/wiki/European'), ('eastern bloc', 'LOC', 'https://en.wikipedia.org/wiki/Eastern_Bloc'), ('first chechen war', 'EVENT', 'https://en.wikipedia.org/wiki/First_Chechen_War'), ('viktor yushchenko', 'PERSON', None), ('supreme court', 'ORG', 'https://en.wikipedia.org/wiki/Supreme_court'), ('yulia', 'GPE', 'https://en.wikipedia.org/wiki/Yulia'), ('anthony cordesman', 'PERSON', None), ('favour putin', 'PERSON', None), ('georgia', 'GPE', 'https://en.wikipedia.org/wiki/Georgia'), ('western european', 'NORP', 'https://en.wikipedia.org/wiki/Western_Europe'), ('ukraine georgia', 'GPE', None), ('us', 'GPE', 'https://en.wikipedia.org/wiki/Us'), ('george bush', 'PERSON', 'https://en.wikipedia.org/wiki/George_Bush'), ('georgia ukraine', 'ORG', None), ('eurasian', 'NORP', 'https://en.wikipedia.org/wiki/Eurasia'), ('eu', 'GPE', 'https://en.wikipedia.org/wiki/European_Union'), ('kremlin', 'ORG', 'https://en.wikipedia.org/wiki/Kremlin'), ('rada', 'GPE', 'https://en.wikipedia.org/wiki/Rada'), ('ukraine russia', 'PERSON', None), ('crimean', 'GPE', 'https://en.wikipedia.org/wiki/Crimea'), ('kacha hvardiiske simferopol', 'PERSON', None), ('insignia', 'GPE', 'https://en.wikipedia.org/wiki/Insignia'), ('aksyonov', 'NORP', 'https://en.wikipedia.org/wiki/Aksyonov'), ('ukraine putin', 'PERSON', None), ('donbas part', 'ORG', None), ('chechen', 'NORP', 'https://en.wikipedia.org/wiki/Chechen'), ('un', 'ORG', 'https://en.wikipedia.org/wiki/UN_(disambiguation)'), ('eastern ukraine', 'NORP', None), ('insignia girkin', 'ORG', None), ('malaysia', 'GPE', 'https://en.wikipedia.org/wiki/Malaysia'), ('united banner', 'ORG', None), ('nikolai mitrokhin', 'PERSON', None), ('novoazovsk azov sea coast', 'FAC', None), ('novoazovsk', 'GPE', 'https://en.wikipedia.org/wiki/Novoazovsk'), ('mariupol un security council', 'ORG', None), ('luhansk suffering', 'ORG', None), ('defence ministry', 'ORG', 'https://en.wikipedia.org/wiki/Ministry_of_defence'), ('luhansk', 'GPE', 'https://en.wikipedia.org/wiki/Luhansk'), ('house', 'ORG', 'https://en.wikipedia.org/wiki/House'), ('novaya gazeta', 'PERSON', 'https://en.wikipedia.org/wiki/Novaya_Gazeta'), ('lev shlosberg', 'PERSON', None), ('valentina matviyenko', 'PERSON', None), ('andrey kelin', 'PERSON', None), ('kelin', 'ORG', 'https://en.wikipedia.org/wiki/Kelin'), ('philip breedlove', 'PERSON', None), ('moscow', 'GPE', 'https://en.wikipedia.org/wiki/Moscow'), ('chicago council global affairs', 'ORG', None), ('united nations security council', 'ORG', None), ('united kingdom', 'ORG', 'https://en.wikipedia.org/wiki/United_Kingdom'), ('united nations human rights office estimated 8000', 'ORG', None), ('vladimir yefimov', 'PERSON', None), ('red cross', 'ORG', 'https://en.wikipedia.org/wiki/International_Red_Cross_and_Red_Crescent_Movement'), ('konstantin zatulin', 'PERSON', None), ('volodymyr zelenskyy', 'PERSON', 'https://en.wikipedia.org/wiki/Volodymyr_Zelenskyy'), ('ukraine separatists', 'PERSON', None), ('nato eu', 'FAC', None), ('vladimir lenin', 'PERSON', 'https://en.wikipedia.org/wiki/Vladimir_Lenin'), ('soviet republic putin', 'GPE', None), ('nikita khrushchev', 'PERSON', 'https://en.wikipedia.org/wiki/Nikita_Khrushchev'), ('world war ii', 'EVENT', 'https://en.wikipedia.org/wiki/World_War_II'), ('russian christians', 'NORP', None), ('jews', 'NORP', 'https://en.wikipedia.org/wiki/Jews'), ('nazi germany', 'GPE', 'https://en.wikipedia.org/wiki/Nazi_Germany'), ('jewish', 'NORP', 'https://en.wikipedia.org/wiki/Jews'), ('jens stoltenberg', 'PERSON', None), ('evacuations donbas', 'PERSON', None), ('federation council', 'ORG', None), ('conscription army', 'ORG', None), ('eastern ukraine command', 'ORG', None), ('aleksandr dvornikov', 'ORG', None), ('europe', 'LOC', 'https://en.wikipedia.org/wiki/Europe'), ('yugoslav', 'NORP', 'https://en.wikipedia.org/wiki/Yugoslav'), ('russians', 'NORP', 'https://en.wikipedia.org/wiki/Russians'), ('east bank', 'GPE', None), ('united nations general assembly resolution', 'ORG', None), ('cnn', 'ORG', 'https://en.wikipedia.org/wiki/CNN'), ('khartoum', 'GPE', 'https://en.wikipedia.org/wiki/Khartoum'), ('budanov', 'PERSON', 'https://en.wikipedia.org/wiki/Budanov'), ('romanian', 'NORP', 'https://en.wikipedia.org/wiki/Romanian'), ('nord stream pipeline', 'ORG', None), ('yuzivska', 'ORG', None), ('arsen avakov', 'PERSON', None), ('balkan', 'LOC', 'https://en.wikipedia.org/wiki/Balkans'), ('germany', 'GPE', 'https://en.wikipedia.org/wiki/Germany'), ('joe biden', 'PERSON', 'https://en.wikipedia.org/wiki/Joe_Biden'), ('german', 'NORP', 'https://en.wikipedia.org/wiki/German'), ('angela merkel', 'PERSON', 'https://en.wikipedia.org/wiki/Angela_Merkel'), ('nord stream political', 'ORG', None), ('poland', 'GPE', 'https://en.wikipedia.org/wiki/Poland'), ('nord', 'ORG', 'https://en.wikipedia.org/wiki/Nord'), ('ukraine naftogaz', 'GPE', None), ('yuriy vitrenko', 'PERSON', None), ('united states', 'GPE', 'https://en.wikipedia.org/wiki/United_States'), ('communist party', 'ORG', 'https://en.wikipedia.org/wiki/Communist_party'), ('syria', 'GPE', 'https://en.wikipedia.org/wiki/Syria'), ('iraq', 'GPE', 'https://en.wikipedia.org/wiki/Iraq'), ('vladislav surkov', 'PERSON', 'https://en.wikipedia.org/wiki/Vladislav_Surkov'), ('putin russian', 'PERSON', None), ('zelenskyy jewish', 'PERSON', None), ('natalia', 'GPE', 'https://en.wikipedia.org/wiki/Natalia'), ('antonova', 'GPE', 'https://en.wikipedia.org/wiki/Antonova'), ('world war ii ukraine rejection adoption general assembly resolutions combating', 'EVENT', None), ('nazi', 'NORP', 'https://en.wikipedia.org/wiki/Nazism'), ('dmitry medvedev', 'PERSON', None), ('medvedev', 'PERSON', 'https://en.wikipedia.org/wiki/Medvedev'), ('medvedev described', 'PERSON', None), ('southern city', 'LOC', None), ('american', 'NORP', 'https://en.wikipedia.org/wiki/American'), ('nafo north atlantic', 'PERSON', None), ('cartoon shiba inu dogs social media', 'ORG', None), ('mikhail ulyanov', 'PERSON', None), ('ukrainians', 'NORP', 'https://en.wikipedia.org/wiki/Ukrainians'), ('islamic', 'NORP', 'https://en.wikipedia.org/wiki/Islam'), ('crocus', 'ORG', 'https://en.wikipedia.org/wiki/Crocus'), ('krasnogorsk moscow', 'FAC', None), ('russia defense ministry', 'ORG', None), ('ukraine russian', 'NORP', None), ('church hierarch patriarch', 'ORG', None), ('germany university', 'ORG', 'https://en.wikipedia.org/wiki/List_of_universities_in_Germany'), ('al jazeera roc', 'ORG', None), ('nakaz decree council present', 'ORG', None), ('roc protodeacon', 'PERSON', None), ('christians', 'NORP', 'https://en.wikipedia.org/wiki/Christians'), ('sergey lavrov', 'PERSON', None), ('eastern', 'ORG', 'https://en.wikipedia.org/wiki/Eastern'), ('cia', 'ORG', 'https://en.wikipedia.org/wiki/Central_Intelligence_Agency'), ('leon panetta', 'PERSON', None), ('abc', 'ORG', 'https://en.wikipedia.org/wiki/ABC'), ('oleksandr turchynov', 'PERSON', None), ('abkhazia south ossetia', 'ORG', None), ('soviet union', 'GPE', 'https://en.wikipedia.org/wiki/Soviet_Union'), ('battalion ukrainian national guard american forces', 'ORG', None), ('pentagon', 'ORG', 'https://en.wikipedia.org/wiki/Pentagon'), ('russian european', 'NORP', None), ('swiss franc climb high dollar high euro', 'ORG', None), ('australian', 'NORP', 'https://en.wikipedia.org/wiki/Australian'), ('crimea taken international republican institute', 'ORG', None), ('western ukraine', 'NORP', None), ('levada', 'ORG', 'https://en.wikipedia.org/wiki/Levada'), ('ukraine street', 'GPE', None), ('ukraine donbas', 'GPE', None), ('congress', 'ORG', 'https://en.wikipedia.org/wiki/Congress'), ('afghanistan', 'GPE', 'https://en.wikipedia.org/wiki/Afghanistan'), ('washington', 'GPE', 'https://en.wikipedia.org/wiki/Washington'), ('timothy snyder', 'PERSON', None), ('new york times', 'ORG', 'https://en.wikipedia.org/wiki/The_New_York_Times'), ('revival putin', 'PERSON', None), ('putin justification', 'PERSON', None), ('christian', 'NORP', 'https://en.wikipedia.org/wiki/Christians'), ('putin adolf hitler conquests', 'PERSON', None), ('iran', 'GPE', 'https://en.wikipedia.org/wiki/Iran'), ('north korea', 'GPE', 'https://en.wikipedia.org/wiki/North_Korea'), ('china', 'GPE', 'https://en.wikipedia.org/wiki/China'), ('chinese', 'NORP', 'https://en.wikipedia.org/wiki/Chinese'), ('arab', 'NORP', 'https://en.wikipedia.org/wiki/Arabs'), ('south africa', 'GPE', 'https://en.wikipedia.org/wiki/South_Africa'), ('south african', 'NORP', None), ('united nations', 'ORG', 'https://en.wikipedia.org/wiki/United_Nations'), ('united nations general assembly', 'ORG', None), ('nicaragua', 'GPE', 'https://en.wikipedia.org/wiki/Nicaragua'), ('encyclopdia britannica', 'FAC', None), ('google news war', 'ORG', None), ('bbc news', 'ORG', 'https://en.wikipedia.org/wiki/BBC_News')]
Finally, we create pandas Dataframe and drop the entities with no URLs. Procuded result we export to the .csv file.
| Entity | Label | URL | |
|---|---|---|---|
| 0 | russia | GPE | https://en.wikipedia.org/wiki/Russia |
| 1 | ukraine | GPE | https://en.wikipedia.org/wiki/Ukraine |
| 2 | russian | NORP | https://en.wikipedia.org/wiki/Russian |
| 3 | donbas | GPE | https://en.wikipedia.org/wiki/Donbas |
| 4 | donbas war | ORG | https://en.wikipedia.org/wiki/War_in_Donbas |
| 5 | vladimir putin | PERSON | https://en.wikipedia.org/wiki/Vladimir_Putin |
| 6 | nato | ORG | https://en.wikipedia.org/wiki/NATO |
| 7 | putin | PERSON | https://en.wikipedia.org/wiki/Vladimir_Putin |
| 8 | soviet | NORP | https://en.wikipedia.org/wiki/Soviet_Union |
| 10 | european | NORP | https://en.wikipedia.org/wiki/European |
| 11 | eastern bloc | LOC | https://en.wikipedia.org/wiki/Eastern_Bloc |
| 12 | first chechen war | EVENT | https://en.wikipedia.org/wiki/First_Chechen_War |
| 14 | supreme court | ORG | https://en.wikipedia.org/wiki/Supreme_court |
| 15 | yulia | GPE | https://en.wikipedia.org/wiki/Yulia |
| 18 | georgia | GPE | https://en.wikipedia.org/wiki/Georgia |
| 19 | western european | NORP | https://en.wikipedia.org/wiki/Western_Europe |
| 21 | us | GPE | https://en.wikipedia.org/wiki/Us |
| 22 | george bush | PERSON | https://en.wikipedia.org/wiki/George_Bush |
| 24 | eurasian | NORP | https://en.wikipedia.org/wiki/Eurasia |
| 25 | eu | GPE | https://en.wikipedia.org/wiki/European_Union |
| 26 | kremlin | ORG | https://en.wikipedia.org/wiki/Kremlin |
| 27 | rada | GPE | https://en.wikipedia.org/wiki/Rada |
| 29 | crimean | GPE | https://en.wikipedia.org/wiki/Crimea |
| 31 | insignia | GPE | https://en.wikipedia.org/wiki/Insignia |
| 32 | aksyonov | NORP | https://en.wikipedia.org/wiki/Aksyonov |
| 35 | chechen | NORP | https://en.wikipedia.org/wiki/Chechen |
| 36 | un | ORG | https://en.wikipedia.org/wiki/UN_(disambiguation) |
| 39 | malaysia | GPE | https://en.wikipedia.org/wiki/Malaysia |
| 43 | novoazovsk | GPE | https://en.wikipedia.org/wiki/Novoazovsk |
| 46 | defence ministry | ORG | https://en.wikipedia.org/wiki/Ministry_of_defence |
| 47 | luhansk | GPE | https://en.wikipedia.org/wiki/Luhansk |
| 48 | house | ORG | https://en.wikipedia.org/wiki/House |
| 49 | novaya gazeta | PERSON | https://en.wikipedia.org/wiki/Novaya_Gazeta |
| 53 | kelin | ORG | https://en.wikipedia.org/wiki/Kelin |
| 55 | moscow | GPE | https://en.wikipedia.org/wiki/Moscow |
| 58 | united kingdom | ORG | https://en.wikipedia.org/wiki/United_Kingdom |
| 61 | red cross | ORG | https://en.wikipedia.org/wiki/International_Re... |
| 63 | volodymyr zelenskyy | PERSON | https://en.wikipedia.org/wiki/Volodymyr_Zelenskyy |
| 66 | vladimir lenin | PERSON | https://en.wikipedia.org/wiki/Vladimir_Lenin |
| 68 | nikita khrushchev | PERSON | https://en.wikipedia.org/wiki/Nikita_Khrushchev |
| 69 | world war ii | EVENT | https://en.wikipedia.org/wiki/World_War_II |
| 71 | jews | NORP | https://en.wikipedia.org/wiki/Jews |
| 72 | nazi germany | GPE | https://en.wikipedia.org/wiki/Nazi_Germany |
| 73 | jewish | NORP | https://en.wikipedia.org/wiki/Jews |
| 80 | europe | LOC | https://en.wikipedia.org/wiki/Europe |
| 81 | yugoslav | NORP | https://en.wikipedia.org/wiki/Yugoslav |
| 82 | russians | NORP | https://en.wikipedia.org/wiki/Russians |
| 85 | cnn | ORG | https://en.wikipedia.org/wiki/CNN |
| 86 | khartoum | GPE | https://en.wikipedia.org/wiki/Khartoum |
| 87 | budanov | PERSON | https://en.wikipedia.org/wiki/Budanov |
| 88 | romanian | NORP | https://en.wikipedia.org/wiki/Romanian |
| 92 | balkan | LOC | https://en.wikipedia.org/wiki/Balkans |
| 93 | germany | GPE | https://en.wikipedia.org/wiki/Germany |
| 94 | joe biden | PERSON | https://en.wikipedia.org/wiki/Joe_Biden |
| 95 | german | NORP | https://en.wikipedia.org/wiki/German |
| 96 | angela merkel | PERSON | https://en.wikipedia.org/wiki/Angela_Merkel |
| 98 | poland | GPE | https://en.wikipedia.org/wiki/Poland |
| 99 | nord | ORG | https://en.wikipedia.org/wiki/Nord |
| 102 | united states | GPE | https://en.wikipedia.org/wiki/United_States |
| 103 | communist party | ORG | https://en.wikipedia.org/wiki/Communist_party |
| 104 | syria | GPE | https://en.wikipedia.org/wiki/Syria |
| 105 | iraq | GPE | https://en.wikipedia.org/wiki/Iraq |
| 106 | vladislav surkov | PERSON | https://en.wikipedia.org/wiki/Vladislav_Surkov |
| 109 | natalia | GPE | https://en.wikipedia.org/wiki/Natalia |
| 110 | antonova | GPE | https://en.wikipedia.org/wiki/Antonova |
| 112 | nazi | NORP | https://en.wikipedia.org/wiki/Nazism |
| 114 | medvedev | PERSON | https://en.wikipedia.org/wiki/Medvedev |
| 117 | american | NORP | https://en.wikipedia.org/wiki/American |
| 121 | ukrainians | NORP | https://en.wikipedia.org/wiki/Ukrainians |
| 122 | islamic | NORP | https://en.wikipedia.org/wiki/Islam |
| 123 | crocus | ORG | https://en.wikipedia.org/wiki/Crocus |
| 128 | germany university | ORG | https://en.wikipedia.org/wiki/List_of_universi... |
| 132 | christians | NORP | https://en.wikipedia.org/wiki/Christians |
| 134 | eastern | ORG | https://en.wikipedia.org/wiki/Eastern |
| 135 | cia | ORG | https://en.wikipedia.org/wiki/Central_Intellig... |
| 137 | abc | ORG | https://en.wikipedia.org/wiki/ABC |
| 140 | soviet union | GPE | https://en.wikipedia.org/wiki/Soviet_Union |
| 142 | pentagon | ORG | https://en.wikipedia.org/wiki/Pentagon |
| 145 | australian | NORP | https://en.wikipedia.org/wiki/Australian |
| 148 | levada | ORG | https://en.wikipedia.org/wiki/Levada |
| 151 | congress | ORG | https://en.wikipedia.org/wiki/Congress |
| 152 | afghanistan | GPE | https://en.wikipedia.org/wiki/Afghanistan |
| 153 | washington | GPE | https://en.wikipedia.org/wiki/Washington |
| 155 | new york times | ORG | https://en.wikipedia.org/wiki/The_New_York_Times |
| 158 | christian | NORP | https://en.wikipedia.org/wiki/Christians |
| 160 | iran | GPE | https://en.wikipedia.org/wiki/Iran |
| 161 | north korea | GPE | https://en.wikipedia.org/wiki/North_Korea |
| 162 | china | GPE | https://en.wikipedia.org/wiki/China |
| 163 | chinese | NORP | https://en.wikipedia.org/wiki/Chinese |
| 164 | arab | NORP | https://en.wikipedia.org/wiki/Arabs |
| 165 | south africa | GPE | https://en.wikipedia.org/wiki/South_Africa |
| 167 | united nations | ORG | https://en.wikipedia.org/wiki/United_Nations |
| 169 | nicaragua | GPE | https://en.wikipedia.org/wiki/Nicaragua |
| 172 | bbc news | ORG | https://en.wikipedia.org/wiki/BBC_News |
This concludes the second phase of a project.
Phase 3: Knowledge Graph Construction¶
The third phase of the project covers all of "graph visualization" parts including: graph representation, entity relationship analysis and, finally, graph visualization.
3.1. Graph Representation¶
Goal: Choose a suitable representation for the knowledge graph, such as an adjacency list or adjacency matrix. Define nodes and edges based on entities and relationships.
The first and second phases were developed in jupyter notebooks, with the whole process developed for one article. I needed more data to construct a knowledge graph and perform entity relationship analysis, so I rewrote the whole process in a Python script that essentially puts the previous code in "production" and extracts entities for 13 articles on the topic of the Russian-Ukrainian war. The result is the dataset with entities per article (extract):
| Entity | Label | Article | |
|---|---|---|---|
| 0 | russia | GPE | Russo-Ukrainian War |
| 1 | ukraine | GPE | Russo-Ukrainian War |
| 2 | russian | NORP | Russo-Ukrainian War |
| 3 | donbas | GPE | Russo-Ukrainian War |
| 4 | donbas war | ORG | Russo-Ukrainian War |
Below is the number of entities per article. At this point, we can expect the graph to be huge due to a lot of overlapping connections.
Article (No. of entities):
Casualties of the Russo-Ukrainian War 56
Child abductions in the Russo-Ukrainian War 73
Disinformation in the Russian invasion of Ukraine 143
List of aircraft losses during the Russo-Ukrainian War 8
List of wars between Russia and Ukraine 11
Russian information war against Ukraine 211
Russian invasion of Ukraine 245
Russian war crimes 128
Russian-occupied territories of Ukraine 52
Russo-Ukrainian War 175
Timeline of the Russian invasion of Ukraine 12
War crimes in the Russian invasion of Ukraine 166
War in Donbas 226
3.2. Entity Relationship Analysis¶
Goal: Establish relationships between entities based on co-occurrence in the same article, references, or other criteria. Determine the weight or strength of relationships.
I have established relationships between entities based on co-occurrence in the same article. For this, I have constructed a double for loop that examines co-occurrence between each pair of entities in the articles and saves them in a networkx graph object. Each co-occurrence adds one weight to the relationship.
# Create a graph object
G = nx.Graph()
# Add nodes (unique entities)
for entity in df['Entity'].unique():
G.add_node(entity)
# Iterate over each group to form edges
for entities in grouped_trimmed:
for i in range(len(entities)):
for j in range(i + 1, len(entities)):
if G.has_edge(entities[i], entities[j]):
# Increase weight if edge already exists
G[entities[i]][entities[j]]['weight'] += 1
else:
# Add new edge with weight 1
G.add_edge(entities[i], entities[j], weight=1)
3.3. Graph Visualization¶
Goal: Utilize graph visualization tools (e.g., NetworkX, Gephi) to visualize the constructed knowledge graph. Experiment with different layouts and configurations.
This was the most challenging part of the project so far. The resulting graph has over 250 vertices, and most of them are connected. Because of that, the number of edges grows astronomically fast. To solve this problem, I decided to explore two options:
- visualize the subgraph with lesser number of node and edges (using trimmed dataset);
- visualize the graph without edges, using placement as an indication of interconnectedness (using the full dataset).
For the first option, I have picked pyvis for additional interactivity. And for the second, I decided to stick with Plotly. Below is the implementation of both approaches.
NB: pyvis did not render properly in VS code or Google Colab environment. So, after that problem, I have switched to plotly.
Later, I found out how to leverage networkx for better visualizaation. I have realised it in the phase four.
This concludes the third phase of a project.
Phase 4: Analysis and Interpretation¶
The fourth phase of the project covers all of "graph analysis" parts, including centrality and cluster analysis.
We will start from visualizing our graph properly. Scaling down edge by lowering alpha, adding edge color, will let us get a proper look at the graph.
Here we see the core of the graph with several outlier groups. Let's zoom in to take a look at the core without the outliers.
We see that the core is itself looks like a conglomerate of several groups. Since we are using a spring layout, the most interconnected nodes are brought to each other and are placed closer to the center. Let's zoom in further to take a look at the center of the core.
Given that we set alpha to .0025, which is extremely strong scaling down, to see dark black edges at the center will only mean that these are extremely interconnected nodes. Given the fact that we formed the graph based on the co-occurence of different named entities, these nodes probably occur in every source. They must have high centrality scores and be a center of the discourse arising from our sources.
4.1. Centrality Analysis¶
Goal: Identify central entities within the knowledge graph using metrics like degree centrality or betweenness centrality.
Thanks to networkx library we can compute all the necessary metrics in just a few lines of code:
# Calculate degree and betweennes centralities and store their values in a list.
degree_centrality = nx.degree_centrality(G)
centrality_values = list(degree_centrality.values())
betweenness_centrality = nx.betweenness_centrality(G)
btw_centrality_values = list(betweenness_centrality.values())
4.1.1. Degree Centrality¶
As I said, before, spring layout places most interconnected nodes in the center. Let's zoom in and inspect the core.
Indeed, the nodes with the highest degree centrality are clustered at the very center of the graph. Let's see, what are these nodes.
We see that the center of the discourse is constituted from a vocabulary that includes names of states, nations, cities, continents ('ukraine', 'russia', 'russian', 'united states', 'washington','chechen', 'europe', 'european'), international organisations ('united nations', 'un'), and even news outlets ('bbc news'). The striking thing is that among the most interconnected nodes, the only person to be in the list is vladimir putin ('putin'). We didn't find in the list Ukrainian, American or EU member-states leaders. Let's take a look at betweenness centrality.
4.1.2. Betweenness Centrality¶
The list of the most interconnected nodes is the same, save for the presence of 'american' which is subsided with 'soviet'.
4.2. Cluster Analysis¶
Goal: Group entities into clusters based on their relationships, helping to identify thematic groups.
Let's look for clusters apply clustering algorithms to the adjacency matrix. We start from the classic k-means.
First we have to choose an appropriate k. For this, I've created a plot of inertia score and sillhouette score versus k. Because sillhouette score doesn't indicate a clear winner, I decided the stick to the 'elbow rule' and choose k=10 based on the inertia-vs-k chart.
The resutls are good for the the 'outer core' cluster but the center is pieced between different clusters. Let's apply density-based algorithms instead. Below is the application of DBSCAN, HDBSCAN, and OPTICS to the graph.
For me, the winner is DBSCAN for it created a clear center core. Now, let's look for different clusters with labels. Below is a technical plot wiht a grid and cluster centers as red dots that I identified experimentally.
Now, we will iterate over groups and see what entities are linked together.
Here we can see sevelar things:
- Our NER model had a quirk, it extracted not only the actual named entities but sometimes a combination of those entities. We see 'putin' in almost any clusterk. However, additional mentions frequently do not add much new information. As a result, graph looks saturated with repetitions. So, for future work a more sophisticated filtering is required.
- Execept for the center of the graph depicted on the Fig.1 which represents the "center" of the discourse and the last figure (Fig. 12) which clearly refers to the Ukrainian historical events, all the other groups looks like a microcosm with similar vocabulary. The clusters aren't grouped by similarity. This may be because each cluster represents a particular source. For a more clear interpretation, further inspection of sources is needed.
This concludes the fourth phase of a project.
Illia Nesterenko,
CSAI-12